Neighbourhood Exploitation in Hypertext Categorization

نویسندگان

  • Houda Benbrahim
  • Max Bramer
چکیده

As the web expands exponentially, the need to put some order to its content becomes apparent. Hypertext categorization, that is the automatic classification of web documents into predefined classes, came to elevate humans from that task. The extra information available in a hypertext document poses new challenges for automatic categorization. HTML tags and linked neighbourhood all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) which extra information hidden in HTML tags and linked neighbourhood pages to take into consideration to improve the classification task, and (ii) how to deal with the high level of noise in linked pages. A hypertext dataset and four well-known learning algorithms (Naïve Bayes, KNearest Neighbour, Support Vector Machine and C4.5) were used to exploit the enriched text representation. The results showed that the clever use of the information in linked neighbourhood and HTML tags improved the accuracy of the classification algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classification Techniques for Categorization of Hypertext Documents

In this paper we investigate techniques for categorization of hypertext documents. Recent years have witnessed a growing interest in applying text categorization techniques to the Web. However, the semi-structured nature of the Web along with diverse subject matter present in it pose interesting challenges for conventional classification techniques. In this paper, we review some of the techniqu...

متن کامل

DHCS: A Case of Knowledge Share in Cooperative Computing Environment

Large-scale hypertext categorization has become one of the key techniques in web-based information acquisition. How to implement efficient hypertext categorization is still an ongoing research issue. This paper introduces the Distributed Hypertext Categorization System (DHCS), in which the Directed Acyclic Graph Support Vector Machines (DAGSVM) for learning multiclass hypertext classifiers is i...

متن کامل

Towards Structure-sensitive Hypertext Categorization

Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteri...

متن کامل

Impact on Performance of Hypertext Classification of Selective Rich HTML Capture

Hypertext categorization is the automatic classification of web documents into predefined classes. It poses new challenges for automatic categorization because of the rich information in a hypertext document. Hyperlinks, HTML tags, and metadata all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) what represe...

متن کامل

Text and Hypertext Categorization

Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorizat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004